Keyword [Word-CNN-RNN] [Character-CNN-RNN]

Reed S, Akata Z, Lee H, et al. Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 49-58.

1. Overview

1.1. Motivation

attribute-based zero shot

attribute do not provide a natural language interface
fine-grained recognition requires commensurately more attribute
natural language provides a flexible and compact way of encoding only the salient visual aspects for distinguishing categories

In this paper, it proposed word or character based methods

word-CNN-RNN
character-CNN-RNN

fine-grained classification
zero-shot
image and video caption

2. Methods

2.1. Embedding

2.1.1. Objective

2.1.2. Inference

compatibility
classifier

2.1.3. Learning

since 0-1 loss is discontinue

2.2. Text-based ConvNet

image width is 1 pixel, channel number is equal to the alphabet size

2.3. Convolutional Recurrent Net (CNN-RNN)

only CNN lacks a strong temporal dependency along the input text sequence
low-level temporal features learned by efficiently with fast CNN
temporal structure can still be exploitd at the more abstract level of mid-level features
final encoded feature is the average hidden unit activation over the sequence

3. Experiments

3.1. Dataset

CUB
Flowers

3.2. Details

crop
horizontal flip
length of words 30. length of character 201
RMSprop

(CVPR 2016) Learning deep representations of fine-grained visual descriptions

1. Overview

1.1. Motivation

2. Methods

2.1. Embedding

2.1.1. Objective

2.1.2. Inference

2.1.3. Learning

2.2. Text-based ConvNet

2.3. Convolutional Recurrent Net (CNN-RNN)

3. Experiments

3.1. Dataset

3.2. Details

3.3. Comparison

1. Overview

1.1. Motivation

1.2. Related Work

2. Methods

2.1. Embedding

2.1.1. Objective

2.1.2. Inference

2.1.3. Learning

2.2. Text-based ConvNet

2.3. Convolutional Recurrent Net (CNN-RNN)

3. Experiments

3.1. Dataset

3.2. Details

3.3. Comparison